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Abstract 

The least absolute shrinkage and selection operator (LASSO) for linear regression exploits the geometric interplay of the I2- 
data error objective and the i?i-norm constraint to arbitrarily select sparse models. Guiding this uninformed selection process with 
sparsity models has been precisely the center of attention over the last decade in order to improve learning performance. To this 
end, we alter the selection process of LASSO to explicitly leverage combinatorial sparsity models (CSMs) via the combinatorial 

vN selection and least absolute shrinkage (CLASH) operator. We provide concrete guidelines how to leverage combinatorial constraints 

^~^ within Clash, and characterize Clash's guarantees as a function of the set restricted isometry constants of the sensing matrix. 

^—^ Finally, our experimental results show that Clash can outperform both LASSO and model-based compressive sensing in sparse 

Cm estimation. 
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I. Introduction 

The least absolute shrinkage and selection operator (LASSO) is the de facto standard algorithm for regression |[l). LASSO 
estimates sparse linear models by minimizing the empirical data error via: 

E^ ^lasso = argmin|||j/-$a;||2 : ||a;||i<A|, (1) 

l-H 
^ where || • jj^ is the ^^-norm. In ([T]i, $ e E™^" is the sensing matrix, y e M™ are the responses (or observations), x G M" is 

Q the loading vector and A S K"'""'' governs the sparsity of the solution. Along with many efficient algorithms for its solution, the 

'""' LASSO formulation is now backed with a rather mature theory for the generalization of its solutions as well as its variable 

pvi selection consistency f^l-fSl. 

^ While the long name attributed to ([T]l is apropos^ it does not capture the LASSO's arbitrariness in subset selection via 

^>0 shrinkage to best explain the responses. In fact, this uninformed selection process not only prevents interpretability of results 

C^ in many problems, but also fails to exploit key prior information that could radically improve learning performance. Based on 

~Y this premise, approaches to guide the selection process of the LASSO are now aplenty. 

Surprisingly, while the prior information in many regression problems generate fundamentally discrete constraints (e.g., on 

^1^ the sparsity patterns or the support of the LASSO solution), the majority of the existing approaches that enforce such constraints 

i^ in selection are inherently continuous. For instance, a prevalent approach is to tailor a sparsity inducing norm to the constraints 

,__( on the support set (c.f , Q). That is, we create a structured convex norm by mixing basic norms with weights over pre-defined 

;• groups or using the Lovasz extension of non-decreasing submodular set functions of the support. As many basic norms have 

. ^H well-understood behavior in sparse selection, reverse engineering such norms is quite intuitive. 

^C While such structure inducing, convex norm-based approaches on the LASSO are impressive, our contention in this paper 

^ is that, in order to truly make an impact in structured sparsity problems, we must fully leverage explicitly combinatorial 

approaches to guide LASSO's subset selection process. To achieve this, we show how Euclidean projections with structured 

sparsity constraints correspond to an integer linear program (ILP), which can be exactly or approximately solved subject to 

matroid (via the greedy algorithm), and certain linear inequality constraints (via convex relaxation or multi-knapsack solvers). 

A key actor in this process is a polynomial-time combinatorial algorithm that goes beyond simple selection heuristics towards 

provable solution quality as well as runtime/space bounds. 

Furthermore, we introduce our combinatorial selection and least absolute shrinkage (Clash) operator and theoretically 

characterize its estimation guarantees. CLASH enhances the model-based compressive sensing (model-CS) framework [7] by 

combining £i-norm and combinatorial constraints on the regression vector Therefore, Clash uses a combination of shrinkage 

and hard thresholding operations to significantly outperform the model-CS approach, LASSO, or continuous structured sparsity 

approaches in learning performance of sparse linear models. Furthermore, CLASH establishes a regression framework where 

the underlying tractability of approximation in combinatorial selection is directly reflected in the algorithm's estimation and 

convergence guarantees. 

This work was supported in part by the European Commission under Grant MIRG-268398, ERC Future Proof, and DARPA KeCoM program #11-DARPA- 
1055. VC also would like to acknowledge Rice University for his Faculty Fellowship. 

'Many of the optimization solutions to LASSO leverage shrinkage operations (e.g., as projections onto the £i-ball) for sparse model selections. However, 
the geometric interplay of the ^2 -data error objective and the £i-norm constraint inherently promotes sparsity, independent of the algorithm. 



The organization of the paper is as follows. In Sections III] and llll) we set up the notation and the exact projections with 



structured sparsity constraints. We develop Clash in Section IV and highlight the key components of its convergence proof 



in Section IV] We present numerical results in Section VI We provide our conclusions in Section VII 



II. Preliminaries 

Notation: We use [x]j to denote the j-th element of x, and let Xi represent the i-th iterate of Clash. The index set of n 
dimensions is denoted as Af = {1,2, . . . ,n}. Given S C J\f, we define the complement set 5"^ — Af \S. Moreover, given a 
set S C JV and a vector x G M", {x)s £ M" denotes a vector with the following properties: [(a;)^]^ = [x]g and [(a:)5]_5c — 0. 
The support set of x is defined as supp(a::) = {i : [x]i ^ 0}. We use \S\ to denote the cardinality of the set S. The empirical 
data error is denoted as /(x) = ||j/ — $a;||2, with gradient defined as V/(a;) = — 2$-^(i/ — $a;), where ^ is the transpose 
operation. The notation '^sfi^) is shorthand for (V/(a;))^. I represents the identity matrix. 

Combinatorial notions of sparsity: We provide some definitions on combinatorial sparse models, and elaborate on a subset 
of interesting models with algorithmic implications. 

Definition 1 (Combinatorial sparsity models (CSMs)). We define a combinatorial sparsity model Ck — {Sm '■ Vm, Sm ^ 

A/", \Sm\ < k} with the sparsity parameter k as a collection of distinct index subsets Sm- 

Throughout the paper, we assume that any CSM Ck is downward compatible, i.e., removing any subset of indices of any 
given element in Ck, it is still in Ck- 

Properties of tlie regression matrix: Deriving approximation guarantees for Clash behooves us to assume the restricted 
isometry property (RIP) (defined below) on the regression matrix $ |8 1. While the RIP and other similar conditions for deriving 
consistency properties of LASSO and its variants, such as the unique/exact representation property or the irrepresentable 
condition ||5), ||6], |[9)-pT|, are unverifiable a priori without exhaustive search, many random matrices satisfy them with high 
probability. 

Definition 2 (RIP I?), fSl). The regression matrix has the k-RIP with an isometry constant 5k when 

{l-5k)\\x\\l<\\^x\\l<{l + 5k)\\x\\l, (2) 

Vsupp(a;) G Ck, where 5k — max^gCt pJ'is^'S ^ ^1 2 2' '^"'^ ^^ '■* ^ submatrix of $ as column-indexed by S. 

Here, we also comment on the scaling of (fc, m, n) for the desired level of isometry. When the entries of $ can be 
modeled as independent and identically distributed (iid) with respect to a sub-Gaussian distribution, we can show that m = 
O (5^^ {\og{2M) + k\og{125^^))) with overwhelming probability |7|. Here, M is the minimum number of subspaces covering 
Ck- While m explicitly depends on n, for certain restricted CSMs, such as the rooted connected tree of |7|, this dependence 
can be quite weak, e.g., m = 0{k). 

III. Exact and approximate projections onto CSMs 

The workhorse of the model-CS approach is the following non-convex projection problem onto CSMs, as defined by Ck, 
which is a basic subset selection problem: 

Vckix) = argmin{||u;-a;||2 : supp(w) e Ck} , (3) 

weR" 

where Vc^{x) is the projection operator. [7] shows that as long as Vcki') is exactly computed in polynomial time for a CSM, 
their sparse recovery algorithms inherit strong approximation guarantees for that CSM. To better identify the CSMs that live 
within the model-CS assumptions, we first state the following key observation — the proof can be found in p2) . 

Lemma 1 (Euclidean projections onto CSMs). The support of the Euclidean projection onto Ck in (|3]l can be obtained as a 
solution to the following discrete optimization problem: 

supp(7'cfc(2^)) = argmaxF(5;a:;), (4) 

S:S£Ck 

where F{S;x) — \\x\\\ — ||(a^)5 ~ x\\\ ^ X]ie5 l[^]i| '■^ ^'^^ modular, variance reduction set function. Moreover, let S £ Ck 
be the minimizer of the discrete problem. Then, it holds that Vc^ (x) = {x)p, which corresponds to hard thresholding. 

The following proposition refines this observation to further accentuate the algorithmic implications for CSMs: 

Proposition 1 (CSM projections via ILP's). The problem Q is equivalent to the following integer linear program (ILP): 

supp argmin {w'^ z : [w]i ^ -\[x]i\'^}, (5) 

z:lz]i€{0,l}, 

supp{z)eCk 
where [z]i, (i — 1, . - . , n), are support indicator variables. 



The proof of Proposition [T] is straightforward and is omitted. 

Regardless of whether we use a dynamic program, a greedy combinatorial algorithm, or an ILP solver, the formulations Q 
or (|5]l make the underlying tractability of the combinatorial selection explicit. We highlight this notion via the polynomial-time 
modular e-approximation property (PMAPe): 

Definition 3 (PMAP^ p2|). A CSM has the PMAP^ with constant e, if the modular subset selection problem Q or the ILP 
(|5]l admit an e-approximation scheme with polynomial or pseudo-polynomial time complexity as a function of n, \fx G M". 
Denoting the e-approximate solution of ^ or ^ as S^, this means F{Se; x) > {1 — e) maxs^Ck P{Sj 2;). 

In this paper, we focus and elaborate on CSMs with PMAPq. 

A. Example CSMs with PMAPq 

Matroids: By matroid, we mean that Ck = {■N' ,1) is a finite collection of subsets of A/^ that satisfies three conditions: (i) 
{0} € I, (a) if S is in I, then any subset of S is also in I, and (iii) for Si,S2 G I and |5i| > |52|, there is an element 
s £ Si\ S2 such that ^2 U {s} is in I. As a simple example, the unstructured sparsity model (i.e., x is fc-sparse) forms a 
uniform matroid as it is defined as the union of all subsets of Af with cardinality k or less. When Ck forms a matroid, the 
greedy basis algorithm can efficiently compute (|3]l by solving Q p3) where sorting and selecting the k largest elements in 
absolute value is suffcient to obtain the exact projection. 

Moreover, it turns out that this particular perspective provides a principled and tractable approach to encode an interesting 
class of matroid- structured sparsity models. The recipe is quite simple: we seek the intersection of a structure provider matroid 
(e.g., partition, cographic/graphic, disjoint path, or matching matroid) with the sparsity provider uniform matroid. While the 
intersection of two matroids is not a matroid in general, we can prove that the intersection of the uniform matroid with any 
other matroid satisfies the conditions above. 

Linear support constraints: Many interesting CSMs Ck can be encoded using linear support constraints of the form: 

Cfc = U supp (z) , 3 := {[zV e {0, l}:Az<b), 

V2e3 

where [A, h\ is an integral matrix, and the first row of A is all I's and [h]i = fc. As a basic example, the neuronal spike model 
of IT4) is based on linear support constraints where each spike respects a minimum refractory distance to each other. 

A key observation is that if each of the nonempty faces of 3 contains an integral point (i.e., forming an integral polyhedra), 
then convex relaxation methods can exactly obtain the correct integer solutions in polynomial time. In general, checking the 
integrality of 3 is NP-Hard. However, if 3 is integral and non-empty for all integral h, then a necessary condition is that A 
be a totally unimodular (TU) matrix fVi\. A matrix is totally unimodular if the determinant of each square submatrix is equal 
to 0,1, or -1. Example TU matrices include interval, perfect, and network matrices [13J . As expected, the constraint matrix A 
of p4) is TU. Moreover, it is easy to verify that the sparse disjoint group model ofjUS) also defines a TU constraint, where 
groups have individual sparsity budgets. 

B. How about PMAPJ 

For completeness and due to lack of space, we only mention PMAP^, which extends the breath of the model-CS approach. For 
a detailed treatment of PMAP^ and Clash, c.f. p2) , which describes multi-knapsack CSMs as a concrete example. Moreover, 
for many of the PMAPq examples above, we can employ e-approximate — ^randomized — techniques to reduce computational 

cost. 

IV. The Clash algorithm 

The new CLASH algorithm obtains approximate solutions to the LASSO problem in ([T|) with the added twist that the solution 
must live within the CSM, as defined by Ck'- 

^clash = argmin{/(a;) : ||x|li < A, supp(a;) G Ck]- (6) 

When available, using the CSM constraint Ck in addition to the £i-norm constraint enhances learning in two important 
ways. First, the combinatorial constraints restricts the LASSO solution to exhibit true model-based supports, increasing the 
interpretability of the solution without relaxing Ck into a convex norm. Second, it empirically requires much fewer number of 
samples to obtain the true solution than both the LASSO and the model-CS approaches. 

We provide a pseudo-code of an example implementation of Clash in Algorithm [T] One can think of alternative ways 
of implementing Clash, such as single gradient updates in Step 2, or removing Step 4 altogether. While such changes may 
lead to different — ^possibly better — approximation guarantees for the solution of (|6]l, we observe degradation in the empirical 
performance of the algorithm as compared to this implementation, whose guarantees are as follows: 



Algorithm 1: Clash Algorithm 



Input: y, $, A, Vc^, Tolerance 77, Maxlterations 
Initialize: xq ^ 0, A'q <- {0}, i <- 
repeat 

S, ^ supp(7'cjV;tF/(2;,))) U A-, 

Vi ^ argmin„^||„|[^<;,_ supp(«)e5. \\v - '^'"Wl 

1% ^ 'Pct.ivi) with Ti 4- supp(7j) 

x.+i ^ argniin^^||^||^<;,^ supp(^)er, Wv ~ ^A\l 

X^+i ^ supp(a:i+i) 

i ^ i + l. 

until ll^'i — a;i_i||2 < »/|l2;'i||2 or Maxlterations. 



Theorem 1 (Iteration invariant). Let x* £ M" be the true vector that satisfies the constraints of (|6]l and let y — ^x* + e be 
the set of observations with additive error e G M™. Then, the i-th iterate xi of CLASH satisfies the following recursion: 



i-i+i 



< p\\xi - x*\\2 + ci((52fc,<53fc)||e|| 



where p ^ y^ -2'° \l ^i"!lTl'' ^^^ Ci{52k:5zk) '* ci constant defined in [161. The iterations contract when S^k < 0.3658. 



Theorem [T] shows that the isometry requirements of Clash are competitive with the mainstream hard thresholding methods 
such as CoSaMP fTT) and Subspace Pursuit [181, even though it incorporates the ^i-norm constraints, which, as Section VI 
illustrates, improves learning performance. 

Remark 1. [Model mismatch and selection] Let us assume a generative model y = <I>/3 + e. Let x* be the best approximation 
of P in Ck within ii-ball of radius A. Then, we can show that the iteration invariant of TheoremU\still holds with SNR ~ 
^ f II , where \\e\\2 < ||£||2 + W^il^ ~ ^*)\\2, where the latter quantity (the impact of mismatch) can be analyzed using 
the restricted amplification property of $ /Tj/. For instance, when Ck is the uniform sparsity model, then ||$(/3 — x*)||2 < 
\/l + (5fc I 11/3 — a;*||2 + %: ). which should presumably be small if the model is selected correctly. 



In the absence of prior information, we automate the parameter selection by using the Donoho-Tanner phase transition 1 19| 
to choose the maximum k allowed for a given (m, ?i)-pair, and then by using cross validation to pick A |20| |. 

V. Proof of Theorem[T] 

We sketch the proof of Theorem [T] a la p7| and pT| assuming the general case of PMAP^. The details of the proof can be 
found in an extended version of the paper p6[ . 

Lemma 2 (Active set expansion - Step I). The support set Si, where \Si\ < 2k, identifies a subspace in C2/C such that: 

\\{X^-X*)sf\\2 < {S3k + S2k + V^{1 + S2k))\\x^ ~ X*\\2 

+ {V2{l + S3k) + V<^ + 52k))\\e\\2 (7) 

Lemma |2] states that, at each iteration. Step 1 of Clash identifies a 2k support set such that the unrecovered energy of x* 
is bounded. For e = 0, Clash exactly identifies the support where the projected gradient onto Ck can make most impact on 
the loading vector in the support complement of its current solution, which are subsequently merged together. 

Lemma 3 (Greedy descent with least absolute shrinkage - Step 2). Let Si be a 2k-sparse support set. Then, the least squares 
solution Vi in step 2 of Algorithm 1 satisfies 



\\V, -X*\\2< -j=L=\\{x, - X*)s^\\2 + ^f^|k||2. 

We borrow the proof of Lemma Is] from |2I |. This step improves the objective function f{x) as much as possible on the 
active set in order to arbitrate the active set. The solution simultaneously satisfies the £i-norm constraint. 

Step 3 projects the solution onto Ck, whose action is characterized by the following lemma. Here, we show the e-approximate 
projection explicitly: 

Lemma 4 (Combinatorial selection - Step 3). Let Vi be a 2k-sparse proxy vector with indices in support set Si, Ck be a CSM 
and 7i the projection of Vi under Ck. Then: 



Step 4 requires the following Corollary to Lemma [8] 

Corollary 1 (De-bias - Step 4). Let Ti be the support set of a proxy vector 7^ where \Ti\ < k. Then, the least squares solution 
Xi+i in Step 4 satisfies 



c,+i - x*h < J-^ H - ^*h + ^^^Iklb- 



yr^^"" "^ 1-^: 



2k 



Step 4 de-biases the current result on the putative solution support. Its characterization connects Lemmas [8] and l9j 

Lemma 5. Let Vi be the least squares solution of the greedy descent step (step 5) and 7^ be a proxy vector to Vi after applying 
Combinatorial selection step. Then, \\"fi — x*\\2 can be expressed in terms of the distance from Vi to x* as follows: 

II * I 

\^^~X 1I2 

< 



^1 + ((1 - e) + 2VT^e)6l^ + ^hk^fe + e • |h 



+ A||e||2 + i?2||a:*||2 + i?3v4MRRR, (8) 

where Di, D2, D3 are constants depending on e, (52fc, (^afc- 

Finally, the proof of Theorem [T] follows by concatenating Corollary [T] with Lemmas l2] [HI and [5] and setting e = 0. 

VI. Experiments 

In the following experiments, we compare algorithms from the following list: (i) the LASSO algorithm p|, (m) the Basis 
Pursuit DeNoising (BPDN) p2) , (iii) the sparse-CLASH algorithm, where Ck is the index set of fc-sparse signals, (iv) the 
model-CLASH algorithrrpl which explicitly carries Ck, and {v) Subspace Pursuit (SP) algorithm |18|, as integrated with the 
model-CS approach. We emphasize here that when A — > 00 in (|6|, Clash must converge to the model-based SP solution. 

The LASSO algorithm finds a solution to the problem defined in ([T}, where we use a Nesterov accelerated projected gradient 
algorithm. The BPDN algorithm in turn solves the following optimization problem: 

S^BPDN = argmin{||a;||;^ : \\^x ~ y\\2 < cr} , (9) 

where a represents prior knowledge on the energy of the additive noise term. To solve (J9]), we use the spectral projected 
gradient method SPGLl algorithm |23|. 

In the experiments below, the nonzero coefficients of x* are generated iid according to the standard normal distribution with 
||x*||2 — 1. The BPDN algorithm is given the true a values. While Clash is given the true value of k for the experiments 
below, additional experiments (not shown) shows that our phase transition heuristics is quite good and the mismatch is graceful 
as indicated in Remark I. All the algorithms use a high precision stopping tolerance r/ = 10^^. 

Experiment 1: Improving simple sparse recovery. In this experiment, we generate random realizations of the model 
y = $x* +e for n = 800. Here, <i> is a dense random matrix whose entries are iid Gaussian with zero mean and variance 1/m. 
We consider two distinct generative model settings: (z) with additive Gaussian white noise with |je|J2 = 0.05, m — 240 and 
k = 89, and {ii) the noiseless model (||e||2 = 0), m = 250 and sparsity parameter k — 93. For this experiment, we perform 
500 Monte Carlo model realizations. 

We sweep A and illustrate the recovery performance of Clash (|6]). Figures l(a)-(b) illustrate that the combination of 
hard thresholding with norm constraints can improve the signal recovery performance significantly over convex-only and hard 
thresholding-only methods — ^both in noisy and noiseless problem settings. For ||e|| = 0, Clash perfectly recovers the signal 
when A is close to the true value. When A ^ ||a;*||i, the performance degrades due to the large norm mismatch. 

Experiment 2: Improving structured sparse recovery We consider two signal CSMs: in the first model, we assume 
fc-sparse signals that admit clustered sparsity with coefficients in C-contiguous blocks on an undirected, acyclic chain graph 
^t\. Without loss of generality, we use C — 5 (Figure 1(c)). The second model corresponds to a TU system where we 
partition the fc-sparse signals into uniform blocks and force sparsity constraints on individual blocks; in this case, we solve 
the set optimization problem optimally via linear programming relaxation (Figure 1(d)). Here, the noise energy level satisfies 
||e||2 = 0.05, and n — 500, m — 125, and fc = 50. In both cases, we conduct 100 Monte Carlo iterations and perform sparse 
estimation for a range of A values. 

In Figure 1(c), we observe that clustered sparsity structure provides a distinct advantage in reconstruction compared to 
LASSO formulation and the sparse-CLASH algorithm. Furthermore, note that when A is large, norm constraints have no effect 
and the model-CLASH provides essentially the same results as the model-CS approach |[7l. On the other hand, the sparse-CLASH 
improves significantly beyond the LASSO solution thanks to the ^i-norm constraint. 

In Figure 1(d) however, the situation is radically changed: while the TU constraints enhance the reconstruction of model-CS 
approach over simple sparse recovery, the improvement becomes quite large as the £i-norm constraint kicks in. We also observe 
the improvement in sparse-CLASH but it is not as accentuated as the model-CLASH. 



'^Clash codes are available for MATLAB at \http://lions. epfl. ch/CLASH 




1 

0.8 

I 

0.6 
0.4 
0.2 



SP 


^— Lasso 

- - -sparse-CLASH 

model-CLASH 


^ ■ 




^~rr 


^y.-- 


- 


.>;;;_''^^'' ■■ ' * . . . . . ♦'' 

||e||2=0.05 """"■".„, 


- 


, ,■ 1 



0.6 



0.8 



1.2 



(c) 



1.4 




1 
0.8 

I 

0.6 
0.4 
0.2 



SP 


^— Lasso 

- - -sparse-CLASH 

model-CLASH 


- 


.^^ 


■•■■■pj-i-i'-*'--— 


BPDN ''"''''"'''■ 

Iklb = 0.05 


* * « ^ 


"""" 


III 



0.6 



0.8 



1.2 



1.4 



(d) 



Fig. \. Median values of signal error \\x- 



Toprow: simple sparsity model under noisy \\e\\2 ~ 0.05 (left column) and noiseless \\e\\2 = 



(right column) settings. Bottom row: the (fc, C)-clustered sparsity model (left column) and the TU model (right column). 



VII. Conclusions 

Clash establishes a regression framework where efficient algorithms from combinatorial and convex optimization can inter- 
face for interpretable and model-based sparse solutions. Our experiments demonstrate that while the model-based combinatorial 
selection by itself can greatly improve sparse recovery over the approaches based on uniform sparsity alone, the shrinkage 
operations due to the £i -constraint has an undeniable, positive impact on the learning performance. Understanding the tradeoffs 
between the complexity of approximation and the recovery guarantees of Clash in this setting is a promising theoretical as 
well as practical direction. 

Appendix 
A. Proof of Theorem 1 

A well-known lemma used in the convergence guarantee proof of CLASH is defined next. The proof is omitted. 

Lemma 6 (Optimality condition). Let Q C M" be a convex set and f : Q ^ ^ be a smooth objective function defined over 
Q. Let tjj* Cz Q be a local minimum of the objective function f over the set 0. Then 



(v/(v*)>-v*>>o, vv^ee, 



(10) 



for all convex sets Q. 



In the derivation of Theorem 1, we assume x* e M" is the loading vector, y G M'" is the set of observations, $ e jjmxn j^ 
the regression matrix and e ~ y— $a;* represents the additive noise term. For clarity reasons, we present the proof of Theorem 
1 as a collection of lemmas to help readability. 



Lemma 7 (Active set expansion). The support set Si, where \Si\ < 2k, identifies a subspace in C2k such that: 

\\{x, - x*)s^\\2 < {Ssk + S2k + Ve(l + <52fc))||a;. -x*h + {VW+S^} + V<^TS^k))\\£h. 



(11) 



Proof: Let Xi U X* denote the union of the support sets of the current estimate Xi and the signal of interest x*. Then, 
the following sequence of inequalities hold true: 

FiX,UX*;Vf{x,)) < F(A',Usupp(7'c,(V;t=/(a;.)));V/(a:;,)) ^ (12) 

(1 - e)F{X, U X*;Wf{x,)) < (1 - e)FiX, U supp(7'c, (V;t,= /(x,))); V/(a;,)) (13) 

Given that support set Si is an e-approximate support set, from the definition of PMAP, (fT3]l is further transformed into: 

(1 - e)F{X, U X*;Vfix,)) < F{Sf,V f{x,)). (14) 



Substituting the definition of the variance reduction modular function F{S;x) = \\x\\2 — \\{x)s — x\\2 = \\{x)s\\2, we get: 

(1 - e)\\Vx^uX'fix^)\\l < \\'^sJix^)\\l ^ (15) 

(1-e) U*iy-<i>x,)) '< U*iy-<i>x,)) '^ (16) 



($*(2/- 



<^x, 



XiUX* 



< 



($*(?/- $a;,)) +e ($*(?/ -$a;0 



XiUX* 



Using the subadditivity property of the square root function and excluding the common distribution ($* {y — ^Xi)) 
we have: 



($*(?/- $x,) 



{X,UX-')\Si 



< 
< 



(^*{y- $x 

(<i>*^{x* ~x,) 
y/e U*^x* -a 
' ('($*$- I) (a;*-x. 



Si\{X,UX') 

Si\{x,uX') 



XiUX' 



{^*e)x,ux- 



(17) 

(A'.uA")n5,' 

(18) 
(19) 



XiUX- 



S^\{Xi\jX') 



{'^*£)sMx^yJX*) 



($*£) 



Xi\jX* 



{Hi) 



< {S3k + V^{l+S2k))\\x^~X*\\2+ {^*e)s^\{X^uX') 

i'^*e)x,ux 



(20) 



(21) 



where (i) is obtained by applying the triangle inequality, (ii) holds since {x* — Xi)g.\i^XiUX*) — and (Hi) is due to 
Cauchy-Swartz inequality and isometry constant definition. 



In addition, we can obtain a lower bound for ||($*(j/ — ^xj) 



(x,ux*)\s,^ 



U*iy-^x,)) = U*<^{x*~Xi)) + (**£)(A',uA'.)\5. 

V J {x,ux-')\Si 2 V J {x,ux')\s, ^ '^ 



- {x* - Xi)(x,ux-')\s^ + {^*£){x,ux*)\Si 

> \\{X* - X0(^.u^.)\5j|2 - II (($*$ - I)(X* - ^0)(^^^^.^^^^ 



\{^*£)(XiUX')\Si\[. 



(0 



> \\{x* - Xi)f^x,UX*)\Si\\2 - S2k\\x* -Xi\\2 - \\i'^*e)(^XiUX-)\Sih 

where (i) is obtained by using Cauchy-Swartz inequality and isometry constant definition. 

Since \\{xi - x*)(x,uX')\s^\\2 = \\{xi - x*)sc\\2, combining (|2T]i and (|25]l, we get: 



IK^;. - x*)s^\\2 < fe + S2k + %/e(l + S2k))\K ~ x*\\2 + (v/2(l + <53fe) + v/e(l + '52fc))||e||2- 
as a consequence of the RIP inequality. 



(22) 

(23) 

(24) 
(25) 

(26) 



Lemma 8. [Greedy descent with least absolute shrinkage] Let Si be a Ik-sparse support set. Then, tlie least squares solution 
Vi given by: 



Wi <— argmin Hy — $u|l2, 

v:\\v\\-i<\,supp{v)^Si 



satisfies: 



\Vi - X*\\2 < 



1 11/ *\ II , \/l + ^2k I I 
, , ^ ll(^»-^ )5,fl|2 + -| ^ Ik||2- 

Proof: We know that supp(wi) e 5^. Starting from \\vi — x*\W, the following holds true: 



h^~X*\\l = \\{v,-X*)s^\l + \\{v.-X*)st\[i 



(27) 



(28) 



(29) 



Using the optimality condition, Vi is the minimizer of \\y — ^v\\2 over the convex set Q = {v : \\v\\i < A, supp(i;) G Si} and 
therefore: 



(V/(z;,), (x* - v,)s,) > ^ {^v, - y, ^v, - x*)s^) < 0. 

We calculate the following: 

\\(vt - x*)s^\\l = {v, -x*,{vi ~x*)s^) 

(v, - x*)s^) - ($w, - y, ^{vi - x*)s^) 



< {vi-x 

= \Vi- X 

= (w,: - X* 



{v^ - x*)s,) - (*w, - <^x* - e, $(ui - x*)s,) 

{v^ - x*)s^ ~ {vr ~ a;*, $*$(wi - x*)s,) + {e, $K - x*)s,) 

(I - $*$)(!;, - x*)s,) + {e, <i>{v, - x*)s,) 



<\{v,- x\ (I - $*$)(t;, - x*)s,)\ + {e, $(f, - x*)s^) 
(«) , 

< hkUvi ~ X*)s,\\2\\Vi - X*\\2 + Vl +hk\\{Vi ~ X*)si\\2\\£\\2, 



(30) 

(31) 
(32) 
(33) 
(34) 
(35) 
(36) 

(37) 



where (z) comes from Cauchy-Swartz inequality and isometry constant definition. Simphfying the above quadratic expression, 
we obtain: 



ll(v^ - x*)s^\\2 < SskH - ^*h + v/l + ^2fe||£||: 



As a consequence, (29i can be upper bounded by: 

\\v^ -x*\\l< {S:,kH - x*\\2 + yrr^llelU)' + UK: - x*)sf\\ 



(38) 



(39) 



We form the quadratic polynomial for this inequality assuming as unknown variable the quantity ||u,; — x*\\2- Bounding by 
the largest root of the resulting polynomial, we get: 



1 



v/1 - SL ^-^3k 



(40) 



Lemma 9. [Combinatorial selection] Let Vi be a 2k-sparse proxy vector with indices in support set Si, C^ be a CSM and 7^ 
the projection of Vi under Ck- Then: 



\l^-v,\\l<{l-e)\\{v,-x*)s^\l + €\\^H\[i 



Proof: 



Let 7°*" denote the optimal combinatorial projection of Vi under C^, i.e. 

7^' = "^Cfc {v,) = arg max F{S] v, 

{vi)s:SeAr,seCk 

By the definition of the non-convex projection onto CSMs, it is apparent that: 

hT~V.\\2<\\iv.-XnS.\\2, 

over Ck since 7°'" is the best approximation to Vi for that particular CSM. 
In the general case, this step is performed approximately and we get 7^ as 



(41) 



(42) 



(43) 



(44) 



an e-approximate projection of Vi with corresponding variance reduction F{Se; Vi). According to the definition of PMAP^, we 
calculate: 



F{S,;v,) > (1 - e) max F{S; v,) => 

S£Ck 

H\i-H'v.r2>ii-e)[\\v£,-hr-^'^\\i 



f»n 



7z-Vi\\i. ^ i'^-e)\\iv^-X*)s^\2+4v^\\'' 



(45) 

(46) 
(47) 

(48) 



Lemma 10. [De-bias] Let Ti be the support set of a proxy vector ji where \Ti\ < k. Then, the least squares solution XiJ^i 
given by: 



salsifies: 



Xj+i ^— argmin ||y — <f>a::||2, (49) 

ll^.+i -^*h< J-^ H -^*\\2 + Y^^lklb- (50) 

Proof: The proof is similar to the proof of the Greedy descent step. Starting from ||.Tj_|_i — x*\\2'- 

\\x,+i -x*\\l^ IKx.+i - x*)rMl + 11(2^^+1 - 2:*)rHl2- (51) 
Similarly to lemma^ Xi+i is the minimizer of ||y — $a:||2 under support set and norm constraints and therefore: 

(V/(x,+i), (x* - x,+i)r.> > ^ ($a;,+i - y, $(x,+i - x*)r^) < 0. (52) 
Following the same procedure, we have: 

\\{xi+i ~x*)t^\\1 = {xi+i -x*,{xi+i -~ x*)ri) (53) 

< {xi+i - X*, {xi+i - x*)ri) - {^Xi+i - y, $(x,,+i - a::*)r,) (54) 
= (xj+i - X*, {xi+i - x*)r,) - {^Xi+i - ^x* - e, $(xi+i - a;*)r,.) (55) 
= (xj+i - X*, (xi+i - x*)ri) - {xi+i - a;*, $*$(xi+i - a;*)r,) 

+ {e,<!>{x^+l-x*)^,) (56) 

= {x,+i -x*,{I- $*$)(x,+i - x*)r,) + (e, $(x,+i - x*)r,) (57) 

< \{x,+i -x*,{l- $*$)(.T,+i - x*)rj| + (e, $(a;,+i - x*)r,) (58) 

(0 / 

< S2k\\{xi+i -x*)rj|2||a;i+i ~ x*\\2 + Vl + 4||(a;i+i - a;*)rj|2|kl|2, (59) 

where (i) is due to Cauchy-Swartz inequality and isometry constant definition. Simplifying the above quadratic expression, 
we obtain 

||(a:,+i - a:*)rJl2 < 52k\\x,+i - a:*|l2 + V^ + SkWsh- (60) 



Thus, ll^i+i — x*\\2 in eq. (51 1 can be upper bounded by the quadratic expression: 



\\x^+l -X*\\l< {S2k\\xi+1 -~X*\\2 + VTT^||e||2)^ + \\{xi+l - X*)r'r\\l (61) 

As in Lemmals] we form a quadratic polynomial from (61 1 and bound H^i+i — x*\\2 by the largest root. Thus, we obtain: 



\\x^+l - x*\\2 < \\{x,+i - a;*)rc||2 + ^Iklb- (62) 

In addition, we observe: 

\\{x,+,~x*)r^^\\2^\\{%-x*)r^^\\2<H~x*\\2, (63) 

and thus: 



ll^.+i - x*\\2 < J-^ H - x*h + T^ll^ll2- (64) 

■ 

Lemma 11. Let Vi be the least squares solution of the Greedy descent step given by 

Wi ^— argmin ||y — ^ujlj, (65) 

'f: II t;|| 1 <X,sLtpp{v)^Si 

and 7i be a proxy vector to Vi after applying Combinatorial selection and Least absolute shrinkage steps. Then, \\"fi — x*\\2 
can be expressed in terms of the distance from Vi to x* as follows: 

H ~x*\\2< ^1 + ((1 - e) + 2x/r^^) J2^ + 2<53fcVi + e ■ \\v, - x*\\2 

+ A||£||2 + ^2||a:*||2+i^3^MMRb, (66) 

where Di, 02, D^ are constants depending on e, (52fc, <^3/c- 
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Proof: We observe the following 



„*l|2 



= hi -^*\\2 + hi "liWl- 2(Wi -X*,Vi -7i) 



(67) 
(68) 
(69) 



Focusing on the right hand side of expression (69 1, [v 



X ,v,- 7j; 



( 3 1 1-( 37 1 where we obtain the following expression: 

\{Vi-X 



(Wi-7i)5jl < hk\\vi-x*\\2\\v, -7j||2 + Vl+'^2fc||Wj -7ill2||£| 



X* , {vi — ji)si) can be similarly analysed as 

(70) 



Now, expression (69 1 can be further transformed as: 

hi -x*\\2 = ll^j -x*\\2 + hi -liWl -2(wi-a;*,Wi-7i) 

< \\vi ~x*\\l + \\vi -jiWl + 2|(wj -x*,Vi -7j)| 

(i) 

< \\Vi - X*\\l + ||Wj -7dl2 +2{S3kht - X*hhi -lih + V'^ + S2khi -7d|2|k||2) 

(") 

< h^-^*h2 + {^-^)hT-V^r2+4v^h2 



+ 2(s3kH - ^l\2\f{i^^)hT^^^^¥+4^2 

where (i) is due to (70i and (m) is due to Lemma M Given that ^/a^ + 5^ < a + b for a,b> 0, we further have: 

II7. -x*h2< hi - ^*ll2 + (1 - ^)hT - «.ll2 + 4vi\\l + ^Sskhi - x*h{VT^ehT - y^h + V^Hh) 

+ 2v/l + 52fc(\/r^||7r - t'^||2 + %/e|k.||2)||e||2 



< hi 



„*l|2 



||^ + (l-6)||(^;,-x*)5J|^ + 6||^;,||^ 



+ 2(53fc||vi - a;*||2(Vl-e||(i;^ - x*)^.!^ + \/e||w»||2) 
+ 2^1+524^/1^^11(1-, - X*)5ji2 + Ve^^l|2)|k||2 



(■''0 



< lit;, - a;*||2 + (1 - e)((53fc||w, - a;*||2 + V'lT^||e||2)^ + e\\v,\\l 
+ 2S3khi - x*hW'^-<^3khi -x*h + VTT^IIelb) + \/e||i'»||2) 
+ 2y^l + S2k{Vl-e{53kh, - a;*||2 + VTT^||£||2) + ^||t'dl2) ||£||2, 



where (i) is due to (43 1 and (m) is due to (38 1. Applying basic algebra on the right hand side of (77 1, we get: 



+ (2(1 - e)53fe Vl + S2k + 4(5,3/0 \/l - e^l + ^2fe) H - a;*ll2||e||2 



+ ((1 - e)(l + S2k) + 2(1 + 62k)Vl^e) \\e\\l 

+ 2(53fc7e||w, - a;*||2||w«||2 + 2v/^(TTMIki||2||e||2 + e\\v,\\ 



(i) 
< 



(1 + ((1 - 6) + 271^)^3^) L, - :r*||2 + ji^^ 



((l-e) + 2^r^^)(l + (52fc), 



,) + 2^T^'e)S. 



3k 



+ 253k\^\\v, - x*\\2hi\\2 + '^^(iiVs^hihhh + 4y^\\l 



where (i) is obtained by completing the squares and eliminating negative terms in (78 1 
Using triangle inequality, we know that: 

Ikj2<||t'.-x*||2 + ||a:*||2, 



and, thus, (79 1 can be further analyzed as: 



+ {2d3kVe + e)\\v, -x*h2 + {2S3kVe\\x*h + W<^ + 62k)hh + 2e||a:*||2)|k,; - x*\[ 



+ 2v/e(l + <52fc)||x*||2l|el|2 + e|k*|| 



(71) 
(72) 

(73) 



(74) 



(75) 



(76) 



(77) 



(78) 



(79) 



(80) 



(81) 
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After tedious computations, we end up with the following inequality: 



H -x*\\2<^l + ((1 -e) + 2^/^^^)(52, + 25^k^r^ + e ■ \\v, - x*h 

+ D,\\e\\2 + D2\\x*h+D3^\\x*h\\e\\2, (82) 

where 

Jl + ((1 - e) + 2VT^5lJ{{l - e) + 2^1^) (1 + 52k) + V<^ + S2k) 

Di^^ , ^ , (83) 

'l + ((1 - e) + 2VT^6l^ + 263kV'^ + e 



k + ((1 -e) + 2VT^5I, + 263kV^^ V l + {(l-e) + 2VT^)6l, + 2S3kV'e + e' 



Ds ^ \J2^e{l + 62k). (85) 

■ 
Using the above lemmas, we now complete the proof of Theorem 1. 



Proof: Combining ( 28 1 with ( 66 1, we get; 



' 1 + ((1 - e) + 2^/T^e)5l^ + 2S3kV~^ + 



e 



H-X*\\2<d ^^ \_S2/ — Uv^-X*)s:r\\2 

+ D4s\\2 + D2\\x*\\2 + D3V\\x*h\\eU (86) 

where 



i?4 = ^1 + , V a/i + ((1 -e) + 2VT^e)Sl, + 2S3kVe + e. (87) 



yi + S2k 

hk 
We know that Xi C Si. Thus, {vi)s<: — iff {xi)s': = 0. Therefore, 

\\iv^ - X*)sf\\2 = WMsf - {X*)sfh = \\ix^)sf ~ {x*)s^\\2 = \\{x^ ~ X*)s^\\2. (88) 



Now, using (111, we form the following recursion: 



h^ - a;*||2 < W ^^ ^ _ ^2 i^sk + S2k + \/i(l + S2k))\\x., - X* \\2 



+ D^Weh + D2\\x*\\2 + D3.y\\x*\\2\\e\\2, (89) 

where 

Finally, substituting ([89| in ([50|, we compute the desired recursive formula: 



\x^+l-x\2 \X^-X \2 Ci{62k,S3k,<^) ^ ,. . \^ fX X ^ / ^ mi\ 

FTn - ^ II .11 + ^l^J^ +C2 d2fe,<53fe,e +C3 d2fe,<53fe,e\/:^T7^, (91) 

\\x*\\2 F 2 SNR V iiVit 



where 5iVi? = ^fJ^ = 'f "" and 



A '53fc + S2k + yi(l + S2k) /l + ((1 - e) + 2\/r^)5|fe + 2,53fe\A + e 

'^ VT^k V ^^. ' '''' 

Cl[02k,03k,q = r- S = + -j 7 — , (93) 

^x X \ ^ 1 / ^3/c\/e + e 

C2(02fc,03fc,e) = 



VI - ^2fc V ^1 + ((1 - e) + 2\/r^^)53\ + 2J3fc\/^ + ' 
1 + ((1 - e) + 2\/r^^)^2^ + 263k^^e + J ' 



(94) 



-D3 

C3 ((52/0 , (^sfc , e) - . ,„ • (95) 

V 1 - ^ife 
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Some of the techniques used in the proof of Theorem 1 borrow from Foucart's paper |21 1. 
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